{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "multiprocessing_rl.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/master/multiprocessing_rl.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hyyN-2qyK_T2",
        "colab_type": "text"
      },
      "source": [
        "# Stable Baselines, a fork of OpenAI Baselines - Easy Multiprocessing\n",
        "\n",
        "Github Repo: [https://github.com/hill-a/stable-baselines](https://github.com/hill-a/stable-baselines)\n",
        "\n",
        "Medium article: [https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82](https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82)\n",
        "\n",
        "[RL Baselines Zoo](https://github.com/araffin/rl-baselines-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines.\n",
        "\n",
        "It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.\n",
        "\n",
        "Documentation is available online: [https://stable-baselines.readthedocs.io/](https://stable-baselines.readthedocs.io/)\n",
        "\n",
        "## Install Dependencies and Stable Baselines Using Pip\n",
        "\n",
        "List of full dependencies can be found in the [README](https://github.com/hill-a/stable-baselines).\n",
        "\n",
        "```\n",
        "sudo apt-get update && sudo apt-get install cmake libopenmpi-dev zlib1g-dev\n",
        "```\n",
        "\n",
        "\n",
        "```\n",
        "pip install stable-baselines[mpi]\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "503Gi2076F7u",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Stable Baselines only supports tensorflow 1.x for now\n",
        "%tensorflow_version 1.x\n",
        "!apt install swig cmake libopenmpi-dev zlib1g-dev\n",
        "!pip install stable-baselines[mpi]==2.10.2"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FtY8FhliLsGm",
        "colab_type": "text"
      },
      "source": [
        "## Import policy, RL agent, ..."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BIedd7Pz9sOs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import time\n",
        "\n",
        "import gym\n",
        "import numpy as np\n",
        "\n",
        "from stable_baselines import ACKTR\n",
        "from stable_baselines.common.vec_env import DummyVecEnv, SubprocVecEnv\n",
        "from stable_baselines.common import set_global_seeds\n",
        "from stable_baselines.common.evaluation import evaluate_policy\n",
        "from stable_baselines.common.cmd_util import make_vec_env"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t5WNF6G5gWZ1",
        "colab_type": "text"
      },
      "source": [
        "## Multiprocessing RL Training\n",
        "\n",
        "To multiprocess RL training, we will just have to wrap the Gym env into a SubprocVecEnv object, that will take care of synchronising the processes. The idea is that each process will run an indepedent instance of the Gym env.\n",
        "\n",
        "For that, we need an additional utility function, `make_env`, that will instantiate the environments and make sure they are different (using different random seed)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TgjfyOTPVxG6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def make_env(env_id, rank, seed=0):\n",
        "    \"\"\"\n",
        "    Utility function for multiprocessed env.\n",
        "    \n",
        "    :param env_id: (str) the environment ID\n",
        "    :param num_env: (int) the number of environment you wish to have in subprocesses\n",
        "    :param seed: (int) the inital seed for RNG\n",
        "    :param rank: (int) index of the subprocess\n",
        "    \"\"\"\n",
        "    def _init():\n",
        "        env = gym.make(env_id)\n",
        "        env.seed(seed + rank)\n",
        "        return env\n",
        "    set_global_seeds(seed)\n",
        "    return _init"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iPGIySi2g_RN",
        "colab_type": "text"
      },
      "source": [
        "The number of parallel process used is defined by the `num_cpu` variable.\n",
        "\n",
        "Because we use vectorized environment (SubprocVecEnv), the actions sent to the wrapped env must be an array (one action per process). Also, observations, rewards and dones are arrays."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pUWGZp3i9wyf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "env_id = \"CartPole-v1\"\n",
        "num_cpu = 4  # Number of processes to use\n",
        "# Create the vectorized environment\n",
        "env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])\n",
        "\n",
        "model = ACKTR('MlpPolicy', env, verbose=0)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yPtIY0dR6ssd",
        "colab_type": "text"
      },
      "source": [
        "Stable-Baselines provides you with make_vec_env() helper which does exactly the previous steps for you:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6TYDgQHz6sIV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# By default, we use a DummyVecEnv as it is usually faster (cf doc)\n",
        "vec_env = make_vec_env(env_id, n_envs=num_cpu)\n",
        "\n",
        "model = ACKTR('MlpPolicy', vec_env, verbose=0)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zjEVOIY8NVeK",
        "colab_type": "text"
      },
      "source": [
        "Let's evaluate the un-trained agent, this should be a random agent."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xDHLMA6NFk95",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# We create a separate environment for evaluation\n",
        "eval_env = gym.make(env_id)\n",
        "\n",
        "# Random Agent, before training\n",
        "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)\n",
        "print(f'Mean reward: {mean_reward} +/- {std_reward:.2f}')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r5UoXTZPNdFE",
        "colab_type": "text"
      },
      "source": [
        "## Multiprocess VS Single Process Training\n",
        "\n",
        "Here, we will compare time taken using one vs 4 processes, it should take ~30s in total."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "e4cfSXIB-pTF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "n_timesteps = 25000\n",
        "\n",
        "# Multiprocessed RL Training\n",
        "start_time = time.time()\n",
        "model.learn(n_timesteps)\n",
        "total_time_multi = time.time() - start_time\n",
        "\n",
        "print(\"Took {:.2f}s for multiprocessed version - {:.2f} FPS\".format(total_time_multi, n_timesteps / total_time_multi))\n",
        "\n",
        "# Single Process RL Training\n",
        "single_process_model = ACKTR('MlpPolicy', env_id, verbose=0)\n",
        "\n",
        "start_time = time.time()\n",
        "single_process_model.learn(n_timesteps)\n",
        "total_time_single = time.time() - start_time\n",
        "\n",
        "print(\"Took {:.2f}s for single process version - {:.2f} FPS\".format(total_time_single, n_timesteps / total_time_single))\n",
        "\n",
        "print(\"Multiprocessed training is {:.2f}x faster!\".format(total_time_single / total_time_multi))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ygl_gVmV_QP7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Evaluate the trained agent\n",
        "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)\n",
        "print(f'Mean reward: {mean_reward} +/- {std_reward:.2f}')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QkWsoZ8emt0e",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
